~ 100% accuracy
In [8]:
source("C:\\Work\\myRfunctions.R")
fnRunDate()
fnInstallPackages()
In [9]:
# Load data
data(BreastCancer)
# rename the dataset
dataset <- as_tibble(BreastCancer)
In [10]:
glimpse(dataset)
Besides the Id, the attributes are factors. For modeling, it may be more useful to work with the data as numbers than factors.
In [11]:
# Converting columns to numeric using "tidyverse"
dataset[, 1:10] <- dplyr::mutate_if(dplyr::select(dataset, 1:10), is.factor, as.numeric)
str(dataset)
In [12]:
# Id can't be used for prediction
dataset$Id <- NULL
In [13]:
psych::describe(dataset, check = T)
There are 13 NA values for Bare.nuclei. May need to remove or impute those for some analysis and modeling.
All attributes have integer values in the range [1,10]. So may not see much benefit from normalizing attributes for instance-based methods like KNN.
There is some imbalance in the Class values.
In [14]:
# summarize the class distribution
fnClassDistribution(Class = dataset$Class)
Let’s look at the correlation between the attributes. We have to exclude the 13 rows with NA values (incomplete cases) when calculating the correlations.
In [15]:
# summarize correlations between input variables
options(warn=-1)
PerformanceAnalytics::chart.Correlation(dplyr::select_if(dataset, is.numeric), histogram=TRUE, pch=".")
I see some modest to high correlation between some of the attributes, like cell shape and cell size at 0.91 correlation. Some algorithms may benefit from removing the highly correlated attributes.
Almost all of the distributions have an exponential or bimodal shape to them. We may benefit from log transforms or other power transforms later on.
In [191]:
# scatterplot matrix
#trellis.par.set(theme = col.whitebg(), warn = FALSE)
caret::featurePlot(x=dataset[, 1:5], y=dataset$Class, plot="ellipse")
The green (benign) a part to be clustered around the bottom-left corner (smaller values) and red (malignant) are all over the place.
In [192]:
# density plots for each attribute by class value
scales <- list(x=list(relation="free"), y=list(relation="free"))
featurePlot(x=dataset[, 1:9], y=dataset$Class, plot="density", scales=scales)
In [193]:
# scatterplot matrix
caret::featurePlot(x=dataset[, 1:9], y=dataset$Class, plot="box")
In [194]:
caret::featurePlot(x=dataset[, 1:9], y=dataset$Class, plot="strip", jitter = TRUE)
In [195]:
str(dataset)
In [196]:
# Split out validation dataset
# create a list of 80% of the rows in the original dataset we can use for training
set.seed(7)
validation_index <- createDataPartition(dataset$Class, p=0.80, list=FALSE)
# select 20% of the data for validation
validation <- dataset[-validation_index,]
# use the remaining 80% of data to training and testing the models
dataset <- dataset[validation_index,]
In [197]:
formula <- Class ~ .
In [201]:
# Evaluate Algorithms
# 10-fold cross validation with 3 repeats
control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
# LG
set.seed(7)
fit.glm <- train(formula, data=dataset, method="glm", metric=metric, trControl=control, na.action=na.pass)
# LDA
set.seed(7)
fit.lda <- train(formula, data=dataset, method="lda", preProc=c("medianImpute"), metric=metric, trControl=control, na.action=na.pass)
# GLMNET
set.seed(7)
fit.glmnet <- train(formula, data=dataset, method="glmnet", preProc=c("medianImpute"), metric=metric, trControl=control, na.action=na.pass)
# KNN
set.seed(7)
fit.knn <- train(formula, data=dataset, method="knn", preProc=c("medianImpute"), metric=metric, trControl=control, na.action=na.pass)
# CART
set.seed(7)
fit.cart <- train(formula, data=dataset, method="rpart", preProc=c("medianImpute"), metric=metric, trControl=control, na.action=na.pass)
# Naive Bayes
set.seed(7)
fit.nb <- train(formula, data=dataset, method="nb", preProc=c("medianImpute"), metric=metric, trControl=control, na.action=na.pass)
# SVM
set.seed(7)
fit.svm <- train(formula, data=dataset, method="svmRadial", preProc=c("medianImpute"), metric=metric, trControl=control, na.action=na.pass)
# Compare algorithms
results <- resamples(list(LG=fit.glm, LDA=fit.lda, GLMNET=fit.glmnet, KNN=fit.knn, CART=fit.cart, NB=fit.nb, SVM=fit.svm))
summary(results)
dotplot(results)
Good accuracy across the board. All algorithms have a mean accuracy above 90%, well above the baseline of 65% if we just predicted benign. The problem is learnable.
Some predictores have skewed distributions. I'll try normalizing using BoxCox.
In [202]:
# Evaluate Algorithms Transform
# 10-fold cross validation with 3 repeats
control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
# LG
set.seed(7)
fit.glm <- train(formula, data=dataset, method="glm", metric=metric, preProc=c("BoxCox","medianImpute"), trControl=control, na.action=na.pass)
# LDA
set.seed(7)
fit.lda <- train(formula, data=dataset, method="lda", metric=metric, preProc=c("BoxCox","medianImpute"), trControl=control, na.action=na.pass)
# GLMNET
set.seed(7)
fit.glmnet <- train(formula, data=dataset, method="glmnet", metric=metric, preProc=c("BoxCox","medianImpute"), trControl=control, na.action=na.pass)
# KNN
set.seed(7)
fit.knn <- train(formula, data=dataset, method="knn", metric=metric, preProc=c("BoxCox","medianImpute"), trControl=control, na.action=na.pass)
# CART
set.seed(7)
fit.cart <- train(formula, data=dataset, method="rpart", metric=metric, preProc=c("BoxCox","medianImpute"), trControl=control, na.action=na.pass)
# Naive Bayes
set.seed(7)
fit.nb <- train(formula, data=dataset, method="nb", metric=metric, preProc=c("BoxCox","medianImpute"), trControl=control, na.action=na.pass)
# SVM
set.seed(7)
fit.svm <- train(formula, data=dataset, method="svmRadial", metric=metric, preProc=c("BoxCox","medianImpute"), trControl=control, na.action=na.pass)
# Compare algorithms
transform_results <- resamples(list(LG=fit.glm, LDA=fit.lda, GLMNET=fit.glmnet, KNN=fit.knn, CART=fit.cart, NB=fit.nb, SVM=fit.svm))
summary(transform_results)
dotplot(transform_results)
There was definite improvement using BoxCox transformation.
The best performer was SVM. I'll tune that algorithm.
The SVM implementation has two parameters that we can tune with the caret package: sigma which is a smoothing term and C which is a cost constraint. You can learn more about these parameters in the help for the ksvm() function ?ksvm. Let’s try a range of values for C between 1 and 10 and a few small values for sigma around the default of 0.1.
In [203]:
# Tune SVM
# 10-fold cross validation with 3 repeats
control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
set.seed(7)
grid <- expand.grid(.sigma=c(0.025, 0.05, 0.1, 0.15), .C=seq(1, 10, by=1))
fit.svm <- train(formula, data=dataset, method="svmRadial", metric=metric, tuneGrid=grid, preProc=c("BoxCox","medianImpute"), trControl=control, na.action=na.pass)
print(fit.svm)
plot(fit.svm)
We can see that we have made very little difference to the results. The most accurate model had a score of in the 97 range using a sigma = 0.1 and C = 1. I could tune further, but I don’t expect a payoff.
In [204]:
# Tune kNN
# 10-fold cross validation with 3 repeats
control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
set.seed(7)
grid <- expand.grid(.k=seq(1,20,by=1))
fit.knn <- train(formula, data=dataset, method="knn", metric=metric, tuneGrid=grid, preProc=c("BoxCox","knnImpute"), trControl=control, na.action=na.pass)
print(fit.knn)
plot(fit.knn)
Tuning made a little difference for KNN
In [205]:
# 10-fold cross validation with 3 repeats
control <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "Accuracy"
# Bagged CART
set.seed(7)
fit.treebag <- train(formula, data=dataset, method="treebag", preProc=c("medianImpute"), metric=metric, trControl=control, na.action=na.pass)
# Random Forest
set.seed(7)
fit.rf <- train(formula, data=dataset, method="rf", metric=metric, preProc=c("BoxCox","medianImpute"), trControl=control, na.action=na.pass)
# Stochastic Gradient Boosting
set.seed(7)
fit.gbm <- train(formula, data=dataset, method="gbm", metric=metric, preProc=c("BoxCox","medianImpute"), trControl=control, verbose=FALSE, na.action=na.pass)
# C5.0
set.seed(7)
fit.c50 <- train(formula, data=dataset, method="C5.0", metric=metric, preProc=c("BoxCox","medianImpute"), trControl=control, na.action=na.pass)
# Compare results
ensemble_results <- resamples(list(BAG=fit.treebag, RF=fit.rf, GBM=fit.gbm, C50=fit.c50))
summary(ensemble_results)
dotplot(ensemble_results)
In [207]:
set.seed(13)
predictions <- predict(fit.knn, newdata=validation, na.action=na.pass)
confusionMatrix(predictions, validation$Class)
In [ ]: